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1 Introduction 

The static analysis of control/command programs, with a view to proving the 
absence of runtime errors, has recently picked up steam, with the inception 
of analyzers capable of scaling up to real industrial programs. In particular, 
it is nowadays possible to build sound and precise static analyzers scaling up 
to realistic industrial situations. A static analyzer takes as input a program 
(source code or object code) and outputs a series of facts, warnings or other 
indications obtained by automatic analysis of that program. 

A static analyzer is said to be sound if all the facts that it derives from 
a program (say, "variable x is always positive") are always true, regardless 
of how and on which inputs the program is run. Sound static analyzers are 
based on a semantics, that is, a mathematical definition of possible program 
executions. 

It is well-known that any method for program verification cannot be 
at the same time sound (all results produced are truthful), automatic (no 
human intervention), complete (true results can always be proved) and ter- 
minating (always produces a result) Q unless one supposes that program 
memory is finite and thus that the system is available to mo del- checking 
techniques. As a result, sound static analyzers are bound to produce false 
alarms sometimes; that is, warnings about issues that cannot happen in re- 
ality. One thus wants analyzers that are precise, that is, model reality so 
closely that they seldom produce false alarms — but also, one wants analyz- 
ers that are efficient, taking only reasonable amounts of time and memory 
to perform an analysis. 

One crucial class of errors for control/command systems is arithmetic 
overflows — say, when converting some value to an integer — in programs 
using floating-point computations. Such errors have already proved to be 
extremely dangerous, having for instance caused the explosion of the Ariane 5 



^The formal version of this result is a classic of recursion theory, known as Rice's 
theorem. 
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on its maiden flight [16]. In order to prove the absence of such errors, static 
analyzers such as Astre^ [TJ [2] have to bound all floating-point variables in 
the program. It is impossible to do so using simple interval arithmetic; in 
order to bound the output of a numerical filter, one has to make the analyzer 
understand the stability conditions of the numerical processing implemented 
in the application to be analyzed. 

In current control/command designs, it is commonplace that the exe- 
cutable is obtained by compiling C code, or assembly code, itself obtained 
by automatic translation from a high-level specification. This high-level spec- 
ification is typically given in a high-level language such as Simuhnlil Lustre 
[H or Scade™j4| 

These languages, in their simplest form, consider programs 
to be the software counterpart of a network of electronic circuits (filters, 
integrators, rate limiters...) connected by wires; this is actually how several 
of these languages represents programs graphically. Several circuits can be 
grouped into a compound filter. 

One advantage of these high-level languages is that their semantics is 
considerably cleaner than those of low-level languages such as C. The filter 
and compound filter constructions provide natural "boundaries" for blocks 
of computations that belong together and probably have some interesting 
and identifiable properties. It is thus interesting to be able to analyze these 
languages in a compositional and modular fashion; that is, the analysis of 
some block (compound filter) is done independently of that of the rest of the 
code, and the result of that analysis may be "plugged in" when analyzing 
larger programs. 

This paper deals with the compositional and modular analysis of lin- 
ear filters. By this, we mean filters that would be linear had they been 
implemented over the real field. Of course, in reality, these filters are imple- 
mented over floating-point numbers and none of the classical mathematical 
relationships hold. We nevertheless provide sound semantics for floating- 
point computations and sound analysis for such filters. 

1.1 Digital filtering 

Control/command programs in embedded applications often make use of 
linear filters (for instance, low-pass, high-pass, etc.). The design principles 
of these filters over the real numbers are well known; standard basic designs 
(Butterworth, Chebyshev, etc.) and standard assembly techniques (parallel, 
serial) are taught in the curriculum of signal processing engineers. Ample 

^http : // www ■ astree ■ ens . fr] 

^Simulink^'^is a tool for modelling dynamic systems and control applications, using 
e.g. networks of numeric filters. The control part may then be compiled to hardware or 
software. 

TT/www.mathworks . com/products/simulink/| 
Scade is a commercial product based on LUSTRE. 
TT/www. est erel- technologies . com/product s/s cade- suite/ | 
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literature has been devoted to the design of digital filters implementing some 
desirable response, for implementation in silicon or in software, in fixed-point 
and in floating-point. [T2] 

However, discrete-time filters are often discussed assuming computations 
on real numbers. There is still some considerable literature on the implica- 
tions of fixed-point or floating-point numbers, but the vast majority of the 
work has focused on "usual case" or "average case" bounds — it is even argued 
that worst-case bounds on ideal filters on real numbers are too pessimistic 
and not relevant for filter design [121 §11-3]. The study of the quantization 
and roundoff noise generated by fixed-point or floating-point implementa- 
tions has mostly been done from a stochastic point of view, in order to prove 
average case properties. 

For our analysis purposes, we need sound worst-case bounds, and practi- 
cal means for obtaining them with reasonable computational resources. For 
these reasons, the point of view of the designers of static analyzers is different 
from that of the filter designers. 

A favorite tool of filter designers is the Z-transform [121 chapter 3], with 
which the overall ideal (i.e. implemented over the real numbers) transfer 
function of a filter is summarized in a rational function with real coefficients, 
whose poles and zeroes determine the frequency response. In this paper, 
we shall show how we can use this transform to automatically summarize 
networks of linear filters; how this transform allows us to compute precise 
bounds on the outcome of the filter, and to statically summarize complex 
filters; and how to deal with roundoff errors arising from floating-point com- 
putations. 

1.2 Contributions of the article 

This article gives a sound abstract semantics for linear numerical filters im- 
plemented in floating-point or fixed-point arithmetics, given as the sum of a 
linear part (using the Z-transform) and a nonlinear part (given using affine 
bounds); this latter part comes from the roundoff noise (and, possibly, some 
optional losses of linear precision done for the sake of the speed of the anal- 
ysis). (Sect.Hlfor the ideal, linear part, [7] for the nonlinear part). 

In many occasions, the computed bounds are obtained from the norms 
(Sect. 12. 3|) of certain power series. In Sect. [U we give effective methods on 
the real numbers for bounding such norms. In SectE] we explain how to 
implement some of these methods efficiently and soundly using integer and 
floating-point arithmetics. In Sect. Owe study a few cases. 

As with other numerical domains such as those developed for Astree, we 
proceed as follows: the exact floating-point concrete semantics is overap- 
proximated by a mathematically simple semantics on real numbers, which 
is itself over approximated by proved bounds, which are themselves further 
over approximated by an executable semantics (implemented partly in exact 
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feedback sub-filter compositional blocks 

Figure 1: Decomposition of the TF2 filter Sn = aoEn + aiEn-i + a2-E'n-2 
+ PiSn-i + (32Sn-2 into elementary blocks. The compositional blocks are 
chained by serial composition. Inside each compositional on the left, elemen- 
tary gates are composed in parallel. On the right hand side, a feedback loop 
is used. 



arithmetics, partly using some variant of interval floating-point computa- 
tions). This ensures the soundness of the effective computations. 
This paper is an extended version of [18| . 

1.3 Introduction to linear filters and Z-transforms 

Let us consider the following piece of C code, which we will use as a running 
example (called "TF2"): 

Y = AO*I + Al*Ibuf[l] + A2*Ibuf[2]; 
= Y + Bl*Obuf[l] + B2*0buf[2]; 
Ibuf [2]=Ibuf [1] ; Ibuf[l]=I; 
Obuf [2]=0buf [1] ; Obuf[l]=0; 

All variables are assumed to be real numbers (we shall explain in later 
sections how to deal with fixed- and floating-point values with full generality 
and soundness). The program takes I as an input and outputs 0; AO etc. are 
constant coefficients. This piece of code is wrapped inside a (reactive) loop; 
the time is the number of iterations of that loop. Equivalently, this filter can 
be represented by the block diagram in Fig. [H 

Let us note oq etc. the values of the constants and in (resp. y„, o„) the 
value of I (resp. Y, 0) at time n. Then, assuming = for A; < 0, we can 
develop the recurrence: On = y„ + 6i.o„_i -F62-On-2 = yn + h-iyn~i+bi-On-2 

+ b2.0n-3)+b2-{yn~2 + bl-On-3 + b2-On-i) = ^n + ^l -yn-l + (^'2 +&f ^0) •yn~2 + • • • 

where . . . depends solely on yk with k < n — 2. More generally: there 
exist coefhcients cq, ci...such that for all n, o„ = 'Ylik=o^kyn-k- These 
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coefficients solely depend on the bk] we shall see later some general formulas 
for computing them. 

But, itself, Un = CLQ.in + ai.in-i + a2.in-2- It follows that there exist 
coefficients (depending on the and the 6^) such that o„ = 'Ylik=o "^fc^n-fc- 
We again find a similar shape of formula, known as a convolution product. 
The sequence is called a convolution kernel, mapping i to o. 

Let us now suppose that we know a bound Mj on the input: for all n, 
\in\ < Mj; we wish to derive a bound Mq on the output. By the triangle 
inequality, \On\ < Ylik=o\^k\--^i ■ "^^^ quantity X^;^.=o kfcl called the 11- 
norm of the convolution kernel c'. 

What our method does is as follows: from the description of a complex 
linear filter, it compositionally computes compact, finite representations of 
convolution kernels mapping the inputs to the outputs of the sub-blocks of 
the filter, and accurately computes the norms of these kernels (or rather, 
a close upper bound thereof). As a result, one can obtain bounds on any 
variable in the system from a bound on the input. 

2 Linear filters 

In this section, we give a rough outline of what we designate by linear filters 
and how their basic properties allow them to be analyzed. 

2.1 Notion of filters 

We deal with numerical filters that take as inputs and output some (un- 
bounded) discrete streams of fioating-point numbers, with causality; that is, 
the output of the filter at time t depends on the past and present inputs 
(times to t), but not on the future inputsU In practice, they are imple- 
mented with a state, and the output at time t is a function of the input at 
time t and the internal state, which is updated. Such filters are typically 
implemented as one piece of a synchronous reactive loop [2l §4]: 

while (true) { 

(state, output) = filter (state , input); 

} 

'There exist non-causal numerical filtering techniques One striking example is Mat- 
lab's filtfilt function, which runs the same causal filter in one direction, then in the 
reverse-time direction over the same signal; the overall filter has zero phase shift at all 
frequencies, a very desirable characteristic in some applications. Unfortunately, as seen 
on this example, non-causal filters require buffering the signal and thus are not usable for 
real-time applications. They are outside the scope of this paper. 
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2.2 Linear filters 



We are particular interested in filters of the following form (or compounds 
thereof): if (sk) and (e^) are respectively the input and output streams of 
the filter, there exist real coefficients oq, ai, . . . a™ and ... Pm such that 
for all time t, st (the output at time t) is defined as: 



Si = ^ akCt-k + ^ PkSt-k 

k=0 k=l 



(1) 



or, to make apparent the state variables. 
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et^n 



(2) 



If the (3 are all null, the filter has necessarily finite impulsional response 
(FIR) while in the opposite case, it may have infinite impulsional response 
(IIR). The reason for this terminology is the study of the reaction of the 
system to a unit impulse (cq = 1 and V/e > = 0). In the case of a 
FIR filter, re + 1 time units after the end of the impulse, the output becomes 
permanently null. In the case of an IIR filter, the output (when computed 
ideally in the real numbers) never becomes permanently null, but rather 
follows some exponential decay if the filter is stable. A badly designed IIR 
filter may be unstable. Furthermore, it is possible to design filters that 
should be stable, assuming the use of real numbers in computation, but that 
exhibit gross numerical distortions due to the use of floating-point numbers 
in the implementation. 

Linear filters are generally noted using their Z-transforn^ 



aQ + aiz-\ h anz'^' 



1 - Piz 



(3) 



The reasons for this notation will be made clear in Sect. 14.51 In particular, 
all the ideal compound linear filters expressible with elementary elements 
such as products by constants, delays, etc. can be summarized by their Z- 
transform (Sect. that is, they are equivalent to a filter whose output is 
a linear combination of the last n inputs and m outputs. The Z-transform 
will also be central in the semantics of floating-point and fixed-point filters 
(Sect.CJ. 



^An alternate notation [12] replaces all occurrences of z by z^^ . In such a formalism, 
conditions such as "the poles must have a module greater than 1" are replaced by the 
equivalent for the inverse, e.g. "the poles must have a module strictly less than 1". We 
chose polynomials in z because they allow using normal power series instead of Laurent 
series. 
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To summarize some salient points of the following sections, FIR filters 
given by a's are very easy to deal with for our purposes, while the stability 
and decay conditions of IIR filters are determined by the study of the above 
rational function and especially the module of the zeroes of the Q{z) = 
1 — Piz — ■ ■ ■ — Prnz"^ polynomial (zq is a zero of Q if Q{zo) = 0). Those 
roots are the inverses of the eigenvalues of the transition matrix. Specifically, 
the filter is stable if all the zeroes have module greater than 1. 

2.3 Bounding the response of the filter 

The output streams of a linear filter, as an element of M^, are linear functions 
of the inputs and the initial values of the state variables (internal state 
variables). 

More precisely, we shall see later that, neglecting the floating-point errors 
and assuming zero in the initial state variables, the output S is the convolu- 
tion product Q-kE of the input E by some convolution kernel Q: there exists 
a sequence ((?n)neM of reals such that for any n, Sn = Yl^=Qlk^n-k- The 
filter is FIR if this convolution kernel is null except for the first few values, 
and IIR otherwise. If the initial state values ri, . . . , r„ are nonzero, then 
S = Qa-k E + riQi + rnQn where the Qi are convolution kernels. 

Let E : {ek)n€N be a sequence of real or complex numbers. We call Leo- 
norm of E, if finite, and note ||£'||oo the quantity sup^^gp^ \ ek\- Because of the 
isomorphism between sequences and formal power series, we shall likewise 
note II J^k^kz'^Woo = sup;. |afc|. We are interested in bounding the response 
of the filter with respect to the infinite norm: i.e. we want to construct a 
function / such that \\S\\oo < /(||i?||oo)- Said otherwise, if for all the past of 
the computation since the last reset of the filter, |e| was less than M, then 
has |s| has been always less than /(|M|) since the last reset. 

If we do not have initialization conditions nor floating-point errors, / will 
be linear, otherwise it will be affine. Let us place ourselves for now in the 
former case: we are trying to find a number g such that USHoo < 5-||-E'||oo. 
For any linear function / mapping sequences to sequences, we call subordinate 
infinite norm of /, noted, ||/||oo the quantity sup||2,||^=]^ ||/(2;)||oo, assuming 
is is finite. We are thus interested in g = \\E Q * E\\oo. If this quantity is 
finite, the filter is stable; if it is not, it is unstable: it is possible to feed an 
input sequence to the filter, finitely bounded, which we result in arbitrarily 
high outputs at some point in time. 

For a sequence (or formal series) A, we note \\A\\i = YlT=o \^k\, called its 
Ll-norm, if finite. Then we have the following crucial and well-known result 
[m §11.3]: 

Lemma 1. H-E ^ Q * -EHoo = ||Q||i- 

Proof. We shall first prove that \\E h-> Q -k E\\oo < ||Q||i; that is, for any 
sequences Q and E, \\Q * E\\oo < ||(5||i.||£^||oo- Let us note C = Q * E. 
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= YJk=Q<lken-k, therefore |c„| < YJl=Q\qk\\en-k\ < l|e||oo- Z]fc=o kfcl < 
l|e||oo-||Q||i- 

We shall then show equality. Let M < \\0\\i. Recall that ||Q||oo = 
Sfc^ol^'sl- Then there exists N such that X^^qI^^I — Choose 6^ = 1 
if k < N and Qn-k > 0, Cfc = —1 otherwise. Clearly, ||i?||oo = 1, and 

= Y^k=o^kqn~k = T.k=o IQti- k\ > M, therefore \\Q * E\\oo > M and 
\\E I— > Q * E\\oo > M. Since this is valid for any M < \\Q\\i, then the 
\\E Q -k E\\oc = \\Q\\i equality holds. □ 

Note that most of the discussion on numerical filters found in the signal 
processing literated is based on the L2-norm ||x||2 = (X^fcLo l^fcP)^'^^ (which 
is adapted to energy considerations) — for instance, for estimating the fre- 
quency spectrum of the rounding noise. We shall never use this norm in this 
article. 

3 Convolution kernels as formal power series 

In the preceding section, we said that the output of the ideal filter is just the 
convolution of the input with some (possibly infinite) kernel. In this section, 
we show how formal power series are a good framework for describing this 
convolution, and basic facts about the kernels of the filters we are interested, 
given as rational functions. 

3.1 Formal power series 

We shall first recall a few definitions and facts about formal power series. 
The algebra formal power series over a field if = M or C is the vector 

space of countably infinite sequences where the product of two sequences 
A : (afc)fcgN and B : {bk)k€N is defined as A.B : {ck)keN by, for all n G N, 
Cn = Yl^=o'^kbn~k (convolution) . Remark that for any algebra operation 
(addition, subtraction, multiplication) and any N, we obtain the same re- 
sults for the coefficients c„ for n < as if A and B were the coefficients of 
polynomials and we were computing the coefficient c„, the n-th degree coef- 
ficient of the polynomial For this reason, we shall from now on note 
^{^) — SfcLo '^kZ^ by analogy with the polynomials. Note that for most 
of this article, we are interested in formal power series and not with their 
possible interpretation as holomorphic functions (i.e. it is not a problem at 
all if the convergence radius of the Ylk=o^kZ^ series is null); we shall note 
the rare occasions when we need convergence properties (and we shall prove 
the needed convergences). If all the are null except for a finite number, 
the formal series ^ is a polynomial. 

'^One can therefore see as the projective limit of the K[X]/ X" quotient rings 

with the canonical K[X]/ X'^^'^ K[X]/X" morphisms in the category of rings. 
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Wherever we have a convolution (at) * (bk) of sequences, we can equiva- 
lently consider a product A.B of formal series. 

We shall often wish to take the inverse of a power series, and the quotient 
A/B of two series. This is possible for any series '}2ik^kb'^ such that 60 is 
not null. We define a sequence of series as follows: = A, ^("+1) = 
- g„ * z'^B where qn = a^n^ /bo- Note that for all n e N, fc < n A^"^ = 
and A = + ^^^0 Qkz'^B; thus for all n, ^ ee J^'^^q q^z'^B (mod X"), 

which may equivalently written as A = Q.B (mod X"). Therefore, ^4 = 
Q.B, which explains why Q can be called the quotient of A by i?. 

A very important case for the rest of the paper is 1/(1 — z) = YlT=o^^- 
Another important constatation is that this quotient formula applied to 

j;. "0 + aiz H h anz"- ... 

^-^■l-P.Z (5^zm 

where S and E are expressed as formal power series is equivalent to running 
the IIR filter defined by the above rational function with E the inputs and 
5 the output. 



3.2 Stability condition 

We manipulate convolution kernels expressed as rational functions where the 
coefficient of degree of the denominator is 1. We shall identify a rational 
function with the associated formal power series. Using complex analysis, we 
shall now prove the following lemma, giving the stability condition familiar 
to filter designers: 

Lemma 2. \\Q\\i < 00 if and only if all the poles of Q are outside of the 
\z\ < 1 unit disc. 

That is: a filter is stable in ideal real arithmetics if and only if all its 
poles have module greater than 1. 

Proof. Consider the poles of the rational function Q. If none are in the 
|z| < 1 unit disc, then the radius of convergence of the power series of 
the meromorphic function Q around has a radius of convergence strictly 
greater than 1. This implies that the series converges absolutely for z = 1 
and thus that ||Q||i is finite. On the other hand, if \\Q\\i < 00 then the series 
converges absolutely within the |z| < 1 unit disc and no pole can be within 
that disc. □ 



4 Compositional semantics: real field 

Now, we have a second look at the basic semantics of linear filters, in order 
to give a precise and compositional exact semantics of compound filters on 
the real numbers. We show that any linear filter with one input and one 
output is equivalent (on the real numbers) to a filter as defined in §2.2[ 
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4.1 Definition 



A filter or filter element has 

• rii inputs Ii, . . . , /„. (collectively, vector /), each of which is a stream 
of real numbers; 

• Tir reset state values ri, . . . , r^^ (collectively, vector R), which are the 
initial values of the state of the internal state variables of the filter 
(inside delay operators) at the beginning of the computation; 

• Uo output streams Oi, . . . , (collectively, vector O). 

If M is a matrix (resp. vector) of rational functions, or series, let Nx{M) 
be the coordinate- wise application of the norm || • to each rational function, 
or series, thereby providing a vector (resp. matrix) of nonnegative reals. We 
note niij the element in M at line i and column j. 

We note by W{z) the field of rational functions over M and by Q[z](2) the 
ring of rational functions of the form P{z)/{1 — zQ{z)) where P and Q are 
polynomials (that is, the ring of rational functions such that the constant 
term of the denominator is not null)Il When F G Q[2;](2), we note the 
Ll-norm of the associated power series. 

When computed upon the real field, a filter F is characterized by: 

• a matrix € ■M.no,ni{Q[z][z)) such that tij characterizes the linear 
response of output stream i with respect to input stream j; 

• a matrix G ■^no,nr{Q[^]{z)) such that dij characterizes the (decay- 
ing) linear response of output stream i with respect to reset value j. 

We note F{I, R) the vector of output streams of filter F over the reals, on 
the vector of input streams / and the vector of reset values R. Then we have 

V/ G (M^)''' yR G R"'- F{I, R) = .1 + .R (5) 

When the number of inputs and outputs is one, and initial values are 
assumed to be zero, the characterization of the filter is much simpler — 
all matrices and vectors are scalars (reals, formal power series or rational 
functions), and is null. We recommend that the reader instantiates our 
framework on this case for better initial understanding. 

4.2 Basic arithmetic blocks 

Plus node implemented in floating point type /: rij = rio = 1, 
T=[l 1], !) = []; 

^This last ring is the localization of the ring M.[z\ of real polynomials at the prime ideal 
{z) generated by z, thus the notation. 
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Figure 2: A feedback filter 

Scale by k node implemented in floating point type /: T = [fc] , Z) = []; 
Delay without initializer (delay for n clock ticks): T = [^'^] , D = 0; 
Unit delay with initializer : T = [z] , 1) = [l] ; 

4.3 Composition 
Parallel composition T = 

Serial composition through filter 1, then 2: 
T = T2.Ti,D=[T2.Di D2]. 

4.4 Feedback loops 

Let us consider a filter consisting of a filter F with m+n inputs and n outputs 
and feedback loops running the n outputs to the last n inputs through unit 
delays. (Fig. [2]) We split into sub-matrices T/ G M.n,m{'Q[z\{z)) and 
To G -A^n,n(QN(2)) representing respectively the responses to the global 
inputs and to the feedback loop. The system then verifies the linear equation 
over the vectors of formal power series: O = Tf .1 + zTq.P + D.R, and thus 
(Id„ - zT^)0 = Tf.I + D^.R. 

By Cor. H Id„ - zT^ is invertible in Mn,n{Q.[z\(z))^ thus T = (Id„ - 
zTq)~^ .T[ and D = {lAn. — zTq)~^ .D^ . Section [8?2] explains how to perform 
such computations in practice. 

4.5 Examples 

A second order IIR linear filter is expressed by 5 = a^.E + ai.delay2(£^) + 
a2.delay2(-E) + /3i.delay;^(S') + /32-delay2(5'). This yields an equation S = 
(ao + a\z + a2z'^)E + {[3iz + l32z'^)S. This equation is easily solved into 
5 = (ao + aiz + a2Z^){l - (5iz - P2z'^)~^-E. 

^ This result is not surprising, because the system, by construction, must admit causal 
solutions. 
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Figure 3: A compound filter consisting of two second order filters and a 
feedback loop. Each TF2 node is a second-order filter whose transfer function 
is of the form (ao + aiz + a2Z^){l — Piz — I32Z^)~^ . 

In Fig. m we first analyze the two internal second order IIR filters sepa- 
rately and obtain 

Ql = -j o -TT^ (6) 

1 - PiZ - j32Z'^ 

ao + aiz + a2Z^ 

(8) 

The we analyze the feedback loop and obtain for the whole filter a rational 
function with a 6th degree dominator: 

^= l + Tz^Q,.Q2^ 

where Qi and Q2 are the transfer function of the TF2 filters (form (oq + 
aiz + a2z'^){l — Piz — /322^)~^), which we computed earlier. 

5 Bounding the 1-norm of series expansions of ra- 
tional functions 

5.1 Inverses of products of afHne forms 

Let be complex numbers of module strictly greater than 1. Let Q{z) be the 
formal power series HILi Qi where the Qi{z) are the power series {z — 
The n-th degree coefficient of qi is by the easy expansion: 

q^^\ the coefficient of in the Q power series, is obtained by successive 
convolution products; it is 

«^"^= E n^f'^ (11) 
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We can therefore bound its module: 



Vi,fcieNAX;ifci=" 



(12) 



The right hand side of the preceding inequality is just the coefficient 



of the series nr=i Qi where g> 



(n) 



jg n-th order 



coefficient of the series. Since |^i| > 1, the convergence radius of 

this last series is strictly greater than 1; furthermore, all its coefficients are 
nonnegative; therefore, the sum of its coefficients is the value of the function 
at z = 1, that is, ■ We can therefore give an upper bound: 



{z - 6) 



Cn) 



< 



(16 



(13) 



5.2 Rough and less rough approximation in the general case 

Let P{z)/Q{z) be a rational function, with P{z) a polynomial of degree 
m Q{z) a monic polynomial of degree n. Let zeroes(Q) be the multiset of 
zeroes of Q (multiple zeroes are counted with their multiplicity). P{z) = 
EkPkz'^Qiz), thus ||P||i < Ek\Pk\-\\Q\\i- Therefore 



< 



|i 



n 



gGzeroes(Q) 



(lei 



(14) 



This is, however, a very coarse approximation. Intuitively, the mass of the 
convolution kernel expressed by the P/Q series lies in its initial terms. Still, 
with the above formula, we totally neglect the cancellations that happen in 
the computation of this initial part of the kernel; i.e. instead of considering 
|a-6|, we bound it by |a| + |6|. The solution is to split \\P/Q\\i into \\P/Q\\f^ 
and ||P/Q||f^. We shall elaborate on this in Sect. I5.5[ 



5.3 Second degree denominators with complex poles 

A common case for filtering applications is when the denominator is a second 
degree polynomial Q of negative discriminant. In this case, the roots of Q 
are two conjugate complex numbers ^ and ^ and the decomposition is as 
follows: 

Piz) „ , . A A 



Po{z) + 



+ 



Q{z) ■ z-i ■ z-i 

where A = P{^,)/{^ — S,). We shall for now leave Pq out. 
We are interested in the coefficients of this series: 



(15) 



A A 

+ 



(16) 
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Let us write A = |A|e*" and ^ = \S,\e^^; then 



|A| 



ia -i(k+l)f3 , -ia Jik+l)f3 



^cos(a-(A: + l)/?) (17) 



To summarize, the sequence is a decreasing exponential of rate 1/|.^| mod- 
ulated by a sine wave and multiplied by a constant factor |A|/|^|. There- 
fore, computing |A| and |^| will be of prime importance. If Q is monic 
Q{z) = z'^ + zix + zq, then = = cq- In the case of a rational function 
of the form 

P{z) _ ao + aiz + a2Z^ 

Q{z) ~ l-/3iz- /32^2 ^^^> 

then 1^1 = 1/32 1 and A = P{^,)/{^ ~ 0- Should we prefer not to compute 
with complex numbers, 

|A|'^AA=^«'«-g_+-y«-« (19) 

The numerator is a symmetric polynomial in ^ and ^, roots of Q, and there- 
fore can be expressed as a polynomial in the coefficients of Q ; its coefficients 
are polynomials in the coefficients of P, therefore the whole polynomial can 
be expressed as a polynomial in the coefficients of P and Q. The denominator 
is just the discriminant of Q. 

I . i2 _ "2^ + 132 (-ai^ - aa ai Pi + a^^ /Ja) -t- 0^2 (ai /3i + ao + 2 /Ja)) 

" " -(/3i^ + 4/32) ^''^ 

We are now interested in bounding |afc|. If we just use | cos {a — {k + 1)(5) \ 
< 1, we come back to the earlier bounds obtained by totally separating the 
series arising from the two poles. 

We shall now obtain a better bound using the following constatation: for 
any real 0, 



I cos^l = Vcos2 Q = ^(l + cos(20))/2 < 2'^/'^ (I + cos(20)/2) (21) 
using the concavity inequality ^Jl + x < \ + x/2. Therefore 

lafcl < ^^pi^l + ^°^(2(q - (/c + l)/3))/2) (22) 

Now, we are interested in bounding YlT=N['^k\- For any a and h, and < 
r < 1 

E,,, I. cos a — r cos(a — 6) , , 

cos(a + kh)r^ = ^ ^ 23 

k=0 
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Let us now see the quality of such bounds Si < S2, Si < S3: 

00 

Si = ^\cos{a + kb)\r'' (24) 

k=0 

00 -. 
k=0 

^ 00 

53 = ^^(l + 2cos(2(a + A:6)))r^ (26) 
fc=o 

o (J_ , i cos(2Q)-rcos(2(a-b)) \ 

V2V1-^ 2' l-2rcos6 + r2 ^ 



Note that ^3 is not necessarily better than 5*2 (for a = and b = 0, S3/ S2 = 
3/{2\/2) ~ 1.06). However, some moderate gains may be obtained; for 
instance, for r = 0.7, a = and 6 = 0.3, Si ~ 2.60, 52 ~ 3.33 and ^3 ~ 2.80. 
For practical purposes, the bound obtained using ^2 is very sufficient and 
easy to compute. We thus opt for this one. 

5.4 Finer bounds using partial fraction decomposition 

It is well known that if Qi are pairwise prime polynomials, and Q is their 
product, then for any polynomial P prime with Q the fraction P/Q admits 
a partial decomposition as P/Q = Pq + X^,, Pi/Qi, where Pq is the Euclidean 
quotient of P by Q and the degree of Pi is strictly less than that of Qi. 

Using the fundamental theorem of algebra, it follows that if the are 
the distinct roots of Q and rrii their multiplicity, then there exist Ajj E C 
such that 

W = Po + i:E(r^ (28) 

Since Q is a real polynomial, its roots are either real, either pairs of and 
conjugate .^j' = with the same multiplicity, and also for all j, Xi'j = Xi' ,j. 

However, while theoretically sound, this result is numerically delicate 
when there are multiple roots, or different roots very close to each other. [T^ 
§1.3] For instance, let us consider a first-degree polynomial P and a second- 
degree polynomial Q, then 

m = ^+^ (29) 

and we obtain Ai = ^(6)7(6 - 6) (and A2 = ^(6)7(6 - 6))- Both 
numbers will get very large, in inverse proportion of ^1 — ^2- While it is 
quite improbable that we should analyze filters where two separate poles 
have been intentionally be placed very close together, it is possible that we 
analyze filters with multiple poles (for instance, the composition of a filter 
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with itself), and, with numerical computations, we would have two extremely 
close poles and thus a dramatic numerical instability. 

We still can proceed with a radius r decomposition of P/Q [131 Def 1.3]: 
instead of factoring Q into a product of z — factors, we factor it into 
a product of Qi such that for any i, and any roots and ^2 of Qi, then 
ICi ~ '^2! < 2r. The same reference describes algorithms for performing such 
decompositions. We obtain a decomposition of the form 



P 



(30) 



where the roots of each Qi are close together, the degree of Pi is less than 
the degree of Pi. From this we obtain the bound 



< llPf 



oili 



1 



Qi 



(31) 



which we can bound using the inequalities given in the preceding subsections. 
We can, as before, improve on this bound by splitting the series between an 
initial sequence and a tail. 



5.5 Development of rational functions and normed bounds 

Let P{z)/Q{z) G Q[2](^) be a rational function representing a power series 
by its development («n)nGN around 0. We wish to bound ||ti||i, which we 
shall note ||P/Q||i. As we said before, most of the mass of the development 
of P/Q lies in its initial terms, whereas the "tail" of the series is negligible 
(but must be accounted for for reasons of soundness). We thus split P/Q 
into an initial development of N terms and a tail, and use 

||P/g||i = ||P/Q||<^ + ||P/Q||P (32) 

||P/(5||i is computed by computing explicitly the N first terms of the de- 
velopment of P/Q- We shall see in Sect. 18.31 the difficulties involved in 
performing such a computation soundly using interval arithmetics. 

Let dq be the degree of Q. The development D of P/Q yields an equation 
P{z) = D{z).Q{z) + R(z).z^. We have P{z)/Q{z) = D{z) + R{z)/Q{z).z^ 
and thus 

IIWIir = IIWIIi<Plloo.||l/Q||i (33) 

The preceding sub-sections give a variety of methods for bounding ||l/Q||i 
using the zeroes of Q{z); Section gives a rough method based on lower 
bounds on the absolute values of the zeroes of Q{z). ||-R||oo is bounded by 
explicit computation of R using interval arithmetics; as we shall see, we 
compute D until the sign of the terms is unknown — that is, when the norm 
of the developed signal is on the same order of magnitude as the numerical 
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error on it, which happens, experimentally, when the terms are small in 
absolute values. Therefore, ||-R||oo is small, and thus the roughness of the 
approximation used ||l/(5||i does not matter much in practice. 

6 Precision properties of fixed- or fioating-point op- 
erations 

In this section, we shall recall a few facts on the errors introduced by fixed- 
and floating-point arithmetics. They will be sufficient for all our reasonings, 
without need for further knowledge about numerical arithmetics. 

Most types of numerical arithmetics, including the widely used IEEE- 754 
floating-point arithmetic, implemented in hardware in all current microcom- 
puters, define the result of elementary operations as follows: if / is the ideal 
operation (addition, subtraction, multiplication, division etc.) over the real 
numbers and / is the corresponding floating-point operation, then f = r o f 
where r is a roundoff function. The roundoff function chooses a value r{x) 
that can be exactly represented in the used fixed- or floating-point data type, 
and is very close to x; specifically, most systems, including all IEEE-754 sys- 
tems, provide the following roundoff functions l^j 

• round to 0: r{x) is the representable real nearest to x in the direction 
of 0; 

• round to -|-oo: r{x) is the representable real nearest to x in the direction 
of -|-oo; 

• round to — oo: r{x) is the representable real nearest to x in the direction 
of — oo; 

• round to nearest (generally, the default mode): r{x) is the repre- 
sentable real nearest to x. 

In this description, we leave out the possible generation of special values 
such as infinities (-|-oo and — oo) and not-a-number (NaN), the latter indi- 
cating undefined results such as 0/0. We assume as a precondition to the 

^°On Intel x86 systems, the description of the exact properties of the floating-point 
arithmetics is complicated by the fact that, by default, with most operating systems and 
languages, the 80287-compatible floating-point unit performs computations internally us- 
ing 80-bit long double precision numbers, even when the compiled program suggests the 
use of standard 64-bit double precision IEEE numbers. Note that such usage of supple- 
mental precision for intermediate computations is allowed by the C standard, for example. 
The flnal result of the computation may therefore depend on the register scheduling and 
optimizations performed by the compiler. Since we reason by maximal errors, our bounds 
are always sound (albeit pessimistic) in the face of such complications, whatever the com- 
piler and the system do. 
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numerical filters that we analyze that they are not fed infinities or NaNs — in- 
deed, in some DSP (digital signal processor) implementations, the hardware 
is incapable of generating or using such values, and in many other implemen- 
tations the system is configured so that the generation or usage of infinities 
issues an exception resulting in bringing the system into a failure mode. Our 
framework provides constructive methods for bounding any floating-point 
quantity x inside the filters as ||x||oo < cq + ^^^^^ Cfc.||efc||oo where the are 
the input streams of the system; it is quite easy to check that the system 
does not overflow (||a:|| < M); one can even easily provide some very wide 
sufficient conditions on the input (||efc||oo < {M — co)/(X]fc=i Cfc))- ^^^^ 
not include such conditions in our description, for the sake of simplicity. 

For any arithmetic operation, the discrepancy between the ideal result x 
and the floating-point result x is bounded, in absolute value, by maxfsrpi £ahs) 
where Sabs is the absolute error (the least positive floating-point number r^l 
and Erei is the relative error incurred (Sabs = 2~^^'^^ ~ 4.94 • lO"'^^'* and 
Erei = 2~^^ ~ 1.11 • 10~^^ for IEEE double precision operations, for the 
worst case with respect to rounding modes). We actually take the coarser 
inequality 

\x- x\ < erei\x\ +eabs (34) 

See [9] for more details on floating-point numbers and [ITj for more about 
the affine bound on the error. 

In the case of fixed-point arithmetics, we have Srei = and Sabs = ^ {6 is 
the smallest positive fixed-point number) if the rounding mode is unknown 
(round to -|-oo, — oo etc.) and S/2 is it is the rounding mode is known to be 
round-to-nearest . 

7 Compositional semantics: fixed- and floating-point 

In this section, we give and a compositional abstract semantics of filters on 
the floating-point numbers. 

7.1 Constraint on the errors 

Our abstract semantics characterizes a fixed- or floating-point filter F by: 

• the exact semantics of the associated filter F over the real numbers 

• an abstraction of the discrepancy A(/) = F{I) — F{I) between the 
ideal and floating-point filters. 

^^The absolute error results from the underflow condition: a number close to is 
rounded to 0. Contrary to overflow (which generates infinities, or is configured to issue 
an exception), underflow is generally a benign condition. However, it precludes merely 
relying on relative error bounds if one wants to be sound. 
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We transform F{I) into the sum of a term that we can bound very 
accurately using algebra and complex analysis, and a nondeterministic input 
A(/) that we cannot analyze accurately and soundly without considerable 
difficulties, but for which bounds are available: assuming for the sake of 
simplicity a single input and a single output and no initialization conditions, 
we obtain an affine, almost linear constraint on the ||A(J)||oo with respect 
to ||I||oo: ||A(/)||oo < efeill^lU + ^abs' short: since the filter is linear, the 
magnitude of the error is (almost) linear. 

We generalize this idea to the case of multiple inputs and outputs. The 
abstract semantics characterizing A is given by matrices ^ G (M+) 

11° such that 



and e^j ^ 

r 



and a vector E 



\\F{I,R)-F{I,R)\\^ < ei,^T-NooiI) + 



(35) 



where F{I, R) is the output on the stream computed upon the floating-point 
numbers on input streams / and initial values /. 



7.2 Basic arithmetic blocks 

Plus node implemented in floating point type /: 

, £re\,D = 0, Sabs = £ 



1. T 



[1 1] 



-D = 0, ErehT 



J 



J 



^rel ^rel 



/ . 

abs' 



Scale by k node implemented in floating point type /: T = [A;] , D = 0, 

f f 

Delay without initializer (delay for n clock ticks): T = [z"'~\, D = 0, 

erel,T = 0, erel.D = 0, ^abs = 

Unit delay v^^ith initializer : T = [z], D = [l] , Erei.T = 0, erei,D = 0, 

^abs = 



Parallel composition block matrices and vectors: 



'rel,T 







'rel.T. 



=-1 









£abs 



'abs 
'abs. 



7.3 Serial composition 

The serial composition of two filters is more involved. Let F and G be 
the ideal linear transfer functions of both filters, and F and G the transfer 
functions implemented over floating-point numbers. 

We have V/ NooiF{I) - F{I)) < e^i.A^oo(/) + efbs {mutatis mutandis 
for G). We are interested in e = NQoiF{I) — F{I)): that is, a vector of 
positive numbers indexed by the outputs of the system such that on every 
coordinate fc, the difference 6 between output k computed over the reals and 
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the floating-point numbers over the same input I verifies \\6\\oo < £k- We 
extend < to real vectors coordinate-wise. 

The following is easier to understand when each filter has a single input 
and a single output; then, all vectors and matrices are scalars (either in M 
or Q[z](z), and iVx(w) is simply 

The vector R of (re)initialization values is split between (those con- 
cerning F) and R'^ (those concerning G). We split the overall output error 
of the system between the part that was introduced by the first filter (and 
then amplified or attenuated by the second filter) and the part that was 
introduced by the second filter, and use the triangle inequality: 

iVoc((G o F)(J, R)~{Go F){I, R)) 

<N^{Go F{I) -Go F{I)) + Noo{Go F{I) -Go F{I)) 
< Wi(G).(F(J) - +£Si,T.iVoo(F(/)) +ffSi,D.A^oc,(i?'') +eL 

< N-,(G).{F{I) ^ F{I)) + e?,i,T.iN^inn) + Noo{F{I) ^ F{m + egi,n.N^{R° 

< (iVl(G) +efei,T).iVoo(F(/) - F(J)) +efel,D.iVoo(i?'') +£?el-A^oo(F(/)) 

< {Ni{G) + egi,T)-{£feLT-N^iI) + ef,i,n-N^iR^) + e^bs) 
+ eSi.T.iVi (F).Afoo (J) + eSi.D.iVoo (i?^) + £abs 
< [(iVi(G)-l-efei.T).efei.T + efei,T.iVi(F)] .N^il) 

+ [(iVi(G) +£Si).£fei,D] .iVoo(i?^) + [efeuo] ■N^iR^') 

+ [(iVi(G)+£?ei).efbs+£fbs] (36) 

Thus e^o? = {N,{G)+efj4^,+eg,.m{F), 

7.4 Feedback loops 

Let us call o^") the vector of outputs of the filter at step n. It is, ideally, a 
linear function of the current input, the preceding inputs, and the preceding 
outputs. On = L{I<n,0<^]\f)- Let us call L the associated floating-point 
function and O the floating-point output of the filter. Let us call A = — 0. 

An = L{I<n, O^n) - L{I<n, O^n) 

= L{I<n, O^n) — L{I<n, O^n)) + L{I<n, 0<Ar) — L(/<„, 0<Ar) 

= (l(/<„, 0<jv) - L{I<n, O^n))) + ^(0, A<;v) (37) 

Let Cn = L{I<n, 0<Af) — L{I<n, 0<Ar)) be the sequence of vectors of "error 
creations" at each iteration. Then A verifies the equation A = C + zTq.A. 
As before, this means A = (Id^ — zTq)~^.C and thus that A''oo(A<„) < 

m{{ldn-zT^)-').N^{C<n). 

Let us split e^j,^ G Mn,n+m{^+) into e^jj e Mn,mO^+) and e^j ^ e 
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Mn,n{^+). Then 



iVoo(C<„) < 4i,,.iVoo(/<Jv) + 4i,o-A^oo(0<^) + 4i,o.A^oo(i?) + efbs 

^ £rel,/-^oo(-^<Ar) + £rel,0-^oo(0<Ar) + grel,0-^oo( pA^ -^0<n) 

A<jv 

+ ^rel,D -^oo + efbs (38) 

But then, noting A = Ni ((Id„ - zT^)^^), 

A^oo(A<„) < A.(ef,i^j.iVoo(/<^) + efei,o-^i(r)-^oo(/<7v) 

+ 4i,o-^^oc(A<;v) + efei,B-A^oo(i2) + (39) 

Let Ki = ^.e^i,o ^ Mn,nO^+) and 

i^2('^,p) = A (ef^i^, +4i^o-^^i(r))-^ + 4i,D-P + ^L (40) 

Then iVoo(A<„) < A'i.Afoo(A<.„)+K2(iVoo(i'), A^oo(^))- This means that the 
sequence Un = A'^oo(A<„) verifies uq = and < Ki.Un+K2{Noo{I), Noo{R)). 
This imphes that for all n, ti„ is less than the least fixed point L of f i— > 

Ki.v + K2{N^{I),NM)- 

Recall that the spectral radius of a matrix M of real numbers is the 

greatest absolute values of its eigenvalues. If Ki is contracting (spectral 

radius less than 1), then v i— > Ki.v + K2{Noo{I), Noo{R)) has a unique 

fixed point, by Banach's fixed point theorem; and 1 — Ki is invertible. 

This fixed point is u = (1 - Ki)~^ K2{NooiI), NodR)). Let e^i.T = (1 - 

K.rKA. (efei,/ + 4i,o-^i(^))> ^rei,D = (1 " Ki)-\ef^^ ,„ and e^bs = (1 - 

iCi)-i.^.efbs- Then A^oo(A) < ere!,T.iVoo(/) + erel,D.A^oo(i?) + ^abs- 

Recall that Ki = A.ef^^Q G M.n,n{^+) where A is the matrix of norms 
A''! ((Id„ — zTq )~^) ; Ki bounds the amount of floating-point imprecision 
that feeds back into the system. A is the amplification bounding matrix 
of the filter consisting merely of the feedback loop of the original filter; if 
the original filter is stable and well-designed, the coefficients of A should be 
moderate, e^j q measures the creation of imprecision in one iteration of the 
internal filter; if the filter is numerically well-designed, then its coefficients 
are very small. On real-world examples, Ki was on the order of magnitude 
of 10-1^ 

This suggests an effective method for bounding from above the various 
quantities of the form {1 — Ki)~^.y that we listed, where y is a column vector 
(if y is a matrix, then split it into its column vectors). 

oo 

d^ = {l-Ki)-\y = Y,Kly (41) 

A;=0 
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is the unique fixpoint of (p = x >—>■ Ki.x + y, which is monotonic and con- 
tracting. Consider the matrix norm subordinate to || • ||oo on vectors: 

\\Ki\\ =sup^ kuj (42) 

i 

J 

This gives a rough bound on doo- 

oo 11 11 

Moolloo < Yl Il^lf-ll2/lloo = Y^YTl- ^^^^ 
fe=0 " ^" 

Let dn = {x^ Ki.x + = doc - dn = Kl+\doo, thus 

\\doo '^nlloo !^ ~. 1 1 1 1 • 1 1 y 1 1 oo • (44) 

Therefore, the following is an upper bound on doo'- 

^ = ^n+f- ||^^^|_.^ -||j/||ooj .Vl (45) 

where Vi is a vector of ones of the same dimension as y. This computation 
may be effectively performed in floating-point arithmetic in order to yield 
a sound upper bound by computing Eqn. W2\ and in round-to--|-C!0 mode 
(x I— > — 1/x is monotonic). Remark that we can directly prove the soundness 
of the resulting B by checking that Ki.B + y is less than B (this check- 
ing phase, though unnecessary assuming a sound implementation, may be 
cheaply performed for the sake of security; while it is possible that the result 
should be correct and the check fails, this seems very unlikely in practice, 
and can be worked around by choosing a slightly larger B). 



7.5 Trading some accuracy for computation speed; nonlinear 
elements 

We have split the behavior of the filter into the sum of the convolution of 
the input signal by the power development of a rational function, represent- 
ing the exact behavior, and some error term. If we compute the rational 
functions exactly over then the rational coefficients might grow ex- 

pensively large. It seems silly to use high precision for the coefficients of 
a system parameterized by floating-point numbers and implemented with 
floating-point errors. Indeed, we may reduce the precision of the coefficients 
of the rational function at the expense of adding to the margin of error. 
An ideal filter of Z-transform the rational function P{z) / {1 — Q{z)) where 

P{^) = EkloPk^'" and Q{z) = 

QkZ with non initialization condition 
is equivalent to a filter with ideal input Z-transform P and ideal feedback 
Z-transform Q (Fig [4]). Such a filter may be soundly approximated by a 
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delay 



Q 



Figure 4: An ideal filter equivalent to a filter of Z-transform P{z)/{1 — Q{z)). 



non-ideal feedback filter F» with Tf " = P», T^" = Q», e^ei,/ = - P||i, 
^rei,/ = IIQ^ ~ Qlli) Sabs = 0, which we know how to solve from Sect. I7.4[ 

More generally: a filter F may be approximated by a filter with 
transfer function T^" = , efj_y = efeLT + ^?e\,T^ 4ld = ^fe\,D + ^rei,D> 
^fbs ~ ^abs ■^here G is the feedback filter with internal filter H given Tf = 

pK = qK s^i,, = - ^lli, s^i,, = - Qlli, efbs = 0. 

Note that this gives a generic method for approximating non-linear el- 
ements occuring in filters, provided that it is possible to split them into a 
linear part and a nonlinear part, the output of which can be bounded by an 
affine function of bounds on the absolute value of the inputs. 

8 Numerical considerations 

We have so far given many mathematical formulas that are exact in the real 
field. In this section, we explain how to obtain sound abstractions for these 
formulas using floating-point arithmetics. 

8.1 Interval arithmetics 

IEEE floating-point arithmetics [9] and good extended precision libraries 
such as MPFR [7] provide functions computing upward rounded (or rounded- 
to-+oo) and downward rounded (or rounded-to—oo) results: that is, if /(xi, . . 
is the exact operation on real numbers and /~ and /"'" are the associated 
floating-point downward and upward operations, then f{xi, . . . is guar- 
anteed to be in the interval [f~{xi, . . . , Xn), f~^{xi, . . . , Xn)], which will guar- 
antee the soundness of our approach. Furthermore, for many operations, 
f~{xi, . . . , Xn) and f~^{xi, . . . , x„) are guaranteed to be optimal; that is, no 
better bounds can be provided within the desired floating-point format; this 
will guarantee optimality of certain of our elementary operations. 
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8.2 Approximate algebraic computations 

In many occasions, we ideally would like to compute on real polynomi- 
als P = Ylk=iPkz'' but instead we compute on floating-point polynomials 
P = Ylk=ii^k, hklz'' abstracting the set 'y{P) of polynomials P such that 
V/c pk G [h,hk]. In practice, it will often be necessary that ^ [lk,hk] 
in order to avoid uncertainties on the degree of the polynomial. All the 
usual polynomial operations (addition, multiplication by a scalar, subtrac- 
tion, multiplication) may be abstracted using interval arithmetics. We also 
include a test containso(-P) whether the null polynomial is in 7(-P). We call 
this structure an abstract ring. 

Given a abstract ring i?, we construct the abstract field of fractions over 
that ring using the following operations: Pi/qi+P2/q2 = {PiQ2+P2qi)/{qiq2), 
k.{p/q) = {k.p)/q, {pi/qi).{p2/q2) = {pi-P2)/{qi-q2), (Pi/^O/feM) = {pi-q2) 
/{qi-P2), containso(p/(7) = containso(p). We can make a simple attempt at 
reducing the fractions by checking that there are no trivial cancellations 
between the numerator and denominator in products and quotients. 

Given an abstract ring K, we construct the abstract ring of matrices 
over that ring with the usual operations: if M = A + rriij = aij + bij; 
if M = A.B, rriij = '^k^i,k-bk,j- If K is an abstract field, we can also 
implement Gaussian elimination in order to compute A~^.B given a square 
matrix A and a matrix B. When we look for a pivot, we select elements e 
such that containso(e) is false. 

Unfortunately, computations on such approximate structures may yield 
unfavorable results. In particular, the absence of simplification between the 
numerator and denominator may yield fractions P{z)/Q{z) where P and Q 
have some common zeroes. The spurious poles that are introduced not be 
that much of a problem if we use partial fraction decomposition (Sect. 15. 4|) . 
for they will yield very small coefficients in the decomposition; however, they 
will make the computations more complex. If using the simple tail bounds 
of Sect. 15.11 the results may be considerably worse. 

A solution is to perform all computations on rational functions exactly 
over Q[2;](^). Then, cancellation between a numerator and a denominator can 
be performed exactly by division by their greatest common divisor, which is 
obtained from Euclide's algorithm over the Euclidean division of polynomi- 
als. No spurious poles may be introduced. However, on large filter networks, 
exact computations may produce exceedingly large integer numerators and 
denominators. It is then possible to apply the approximation scheme of 
Sect. 17.51 in order to trade speed for potential precision. This is the solution 
that we implemented in our system: exact computations on rational num- 
bers and safe approximations to limit the length of the numbers involved in 
the computations. 
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8.3 Computation of developments 

When bounding the norm HP/QUi of a series quotient of two polynomials, 
we split the series into its N initial terms of development, which we compute 
explicitly, and a tail whose norm we bound. The first idea is to compute the 
first terms of the series by quotienting the series, as explained in Sect. I3.ll 
or, equivalently, by running the filter for iterations on the Dirac input 
1, 0, 0, . . .. In order to provide a sound result, one would work using interval 
arithmetics over floating-point numbers. However, as already noted by Feret, 
after some number of iterations the sign of the terms becomes unknown and 
then the magnitude of the terms increase fast; it is therefore indicated to 
compute the development until the first term of unknown sign is reached, 
and assign N accordingly (one may still also enforce a maximal number of 
iterations A^max)- In order to be able to develop the quotient further with 
good precision, one can use a library of extended-precision floating-point 
computations with selectable rounding direction, such as the MPFR library 
now part of GNU MP [7]. 

8.4 Bounding the roots 

In order to bound ||P/Q||i, where P and Q may possibly be given using 
interval coefficients, we have to bound the roots of Q. More formally, we 
have to solve the following problem: given an interval polynomial P{z) = 
Y,k=i[^k; hk]z'' such that ^ [ln,hn\, find a family {ik,Pk)i<k<n (Cfc G C 
with and floating-point numbers, pk £ a floating-point number) 
such that for any polynomial P = Y12=iPk such that VA; pk S [lk,hk], then, 
up to a permutation, the n roots ('^fc)i<fc<n of P are such that ^k £ p^) 
where D{z,r) is the closed disc of center z and radius r. 

Often, what we need is actually bounds on the this can easily be 
obtained from the preceding bounds using interval arithmetic on plus, minus, 
multiply and square root. 

Our coefficients are intervals [lk,hk] in order to accommodate possible 
errors of floating-point computations. As a consequence, it is expected that 
hk — Ik are small. This suggests to us a two-step method for obtaining the 
desired bounds: 

1. Use an efficient and, in practice, very accurate algorithm to obtain 
approximations Xj to the roots of Ylk=i ^-^^^z^ (the "midpoint poly- 
nomial") . 

2. Prom those approximations, obtain bounds on the radius of the error 
committed. 

There exist a variety of methods and implementations to perform the first 
point. We used gsl_poly_complex_solve of the GNU Scientific Library [8], 
which is based on an eigenvalue decomposition of the companion matrix. 



25 



For the second step, Rump describes a variety of bounding methods [22] 
which take a polynomial and approximate roots as an input and output 
error radii; these methods may be performed using interval arithmetics. We 
implemented the simplest and roughest one: is in a closed disc of center 
Xj — pj and radius \pj \ where 

nP(xi) , , 

Pi = rr ' (46) 

which is easily implemented using interval arithmetics [P becomes P etc.). 



9 Implementation and case studies 

We implemented the algorithms described above in a simple Objective Caml 
[TS] program: filters are represented by a record of all their characteristics 
(transfer matrices, bounding matrices); functions (in the OCaml) sense con- 
struct filter records, or perform composition operations. 

The formal computations on fractions are performed over Q, implemented 
using GNU MP's mpq type [7]. We initially considered using MPFR [10], an 
extended precision library with sound rounding modes, for interval computa- 
tions; instead, we simply use the IEEE- 754 rounding modes of the hardware 
floating-point unit, which is much faster. 



9.1 Composition of TF2 filters 

Let us recall the example of Sect. 14.51 It is a composition of two TF2 filters 
with a feedback loop around it. The serial composition of the filter in Fig. [3] 
and another TF2 filter, all with realistic coefficients, is analyzed in about 
0.10 s on a recent PC; the analyzer finds that \\S\\ < g\\E\\ with g ^ 2, with 

e,el - 10-12 and Eabs - IQ-^OS. 

The power series developments of rational functions (Sect. [8311 are done 
up to around the 27th order. 



9.2 Complex nonlinear iterated filter 

We now consider a nonlinear, iterated filter due to Roozbehani et al. |21][§5]. 
We first analyze separately f ilterl () (2nd-order linear filter) and f ilter2 () 
(2nd-order afhne filter). So as to simplify matters, we do not give the trans- 
fer functions using matrices, matrices inverses etc. but as the solution of 
a system of linear equations over polynomials in z. We obtain that system 
very simply from the program: whenever we see an assignment x := e, we 
turn it into an equation x = e (we assume without loss of generalities that 
variables are only assigned once in a single iteration step), where e is the 
original expression where a variable v that has not yet been assigned in the 
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current iteration is replaced by iy + z.v, standing for the initialization 
value of V. 



void filterl () { 

static float E[2], S [2] ; 
if (INITl) { 

S[0] = X; P = X; 
E[0] = X; E[1]=0; S[1]=0; 
} else { 

P =0.5*X-0.7*E[0] +0.4*E[1] 
+1.5*S[0]-S[1]*0.7; 

E[l] = E[0] ; 
E[0] = X; 
S[l] = S[0] ; 
S[0] = P; 
X=P/6+S[l]/5; 



p = 0.5e — 0.7(ieo + z.eo) 
+0.4(tei +z.ei) 

+ 1.5(iso + z.sq) - 0.7(isi + z.si) 
ei = ieo + z.eo 
eo = e 

si = + z.ei 
X ~ p/Q + si/5 



We call e the input value for X. We solve the system and obtain x = 
Q-e + Qi^^ .ieo + Qie^ -^ei + Qiso -^so + Qisi -^si ■ The common denominator of 
the Q fractions is 10 — 15z + 7z'^, which has complex conjugate roots z such 
that l^l ~ 1.2. = = and ieo — ^so = (the last value for input 
e such that INITl is true), thus ||a;||oo < ||Q||i-||e||oo + WQieo + QisJloo-Wi^W- 
With a precondition ||e||oo < 400, this yields || 

■^lloo ^ 339. If we take the 
coarser inequality ||a;||oo < ||<5||i-||e||oo + (HQieolloo + IIQj.J|oo)-IN| we get 
||x||oo < 528. Roozbehani et al. find a bound ~ 531. 

void filter2 () { 

static float E2 [2] , S2[2]; 
if (INIT2) { 

S2[0] =0.5*X; P = X; 

E2[0] = 0.8*X; E2[l]=0; S2[l]=0; 
} else { 

P =0.3*X-E2[0]*0.2+E2[1]*1.4 p = 0.3e-0.2{ie„ + z.eo) 
+S2[0]*0.5-S2[1]*1.7; +lA{i,, + z.ei) 

+0.5(iso + z.sq) + 1.7(isi + z.si) 
E2[l] =0.5*E2[0]; d = 0.5(«eo + z.eo) 

E2[0] = 2*X; eo = 2e 

S2[l] = S2[0]+10; Si=iso+ z.sq + t 

S2[0] = P/2+S2[l]/3; So=p/2 + si/3 

X=P/8+S2[l]/10; X ^ p/8 + si/10 



We proceed similarity (with the introduction of r = 10/(1 — z) and 
obtain x = Q.e + Qi^^ .ieo + Qi^^ -iei + Qi,^ -iso + Qis^ ■'^si + Qc- The common 
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denominator of the Q is 60 + 35z + 51z^, with complex conjugate roots z 
such that \z\ ~ 1.08. Then ||x||oo < ||<3|| 1 ■ II ^11 oo + ||0.8Qi,^+0.5Q,,J|oo.|k|| + 
IIQdloo- This yields ||x||oo < 1105. 

The two linear filters are combined into an iterated nonlinear filter. 
filterlO (resp. filter2()) is run with a pre-condition of X € [-400,400] 
(resp. [—800,800]). We replace the call to the filter by its postcondition 
XG [-339,339] (resp. XG [-1105,1105]). 
void main () { 
X = 0; 

INITl = TRUE; INIT2=TRUE; 
while (TRUE) { 

X = 0.98 * X + 85; 
if (abs(X)<= 400) { 
filterl 0; 
X=X+100; 
INIT1=FALSE; 
} else 

if (abs(X)<=800) { 
f ilter2() ; 
X=X-50; 
INIT2=FALSE; 

> 

}} 

The program then can be abstracted into: 
while (TRUE) { 

X = 0.98 * X + 85; 

maybe choose X in [—1155,1055]; 

} 

We obtain X G [—1155,4250.02] by running Astree with a large number 
of narrowing iterations, whereas Astree cannot analyze the original program 
precisely and cannot bound X. In this case, the exact solution [—1155,4250] 
{x = 0.98x + 85 has for unique solution x = 4250) could have been computed 
algebraically, but in more complex filters this would not have been the case. 
Roozbehani et al. have a bound of 4560. 

Note that the non-abstracted program converges to a value ~ 205, with 
X G [0,209]. However, this very simple program illustrates our methodology 
for compositional analysis: finding the optimal solution is possible here be- 
cause the program is simple, but would not be possible in practice if we had 
added more nonlinear behavior and nondeterministic inputs, as in real-life 
reactive code; whereas by analyzing precisely each linear filter and plugging 
the results back into a generic analyzer, we get reasonable results. 
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10 Related works 



In the field of digital signal processing, some sizable literature has been de- 
voted to the study of the effects of fixed-point and floating-point errors on 
numerical filters. In the area of fixed-point computation, bounds on the sizes 
of the various operands are of paramount importance: operands that leave 
the prescribed range will undergo saturation and the output signal will be 
distorted. For these reasons, operands are scaled so as not to produce digital 
saturation; yet, the scale factor should be made large enough that rounding 
errors are very small compared to the typical magnitude of the signal. While 
the fact that the U-norm of the convolution kernel is what matters for judg- 
ing overflow, it is argued that this norm is "overly pessimistic" [121 §11-3] 
[TTl eq 13], not to mention the difficulties in estimating it. In practice, filter 
designers have preferred criteria that indicate no saturation for most "com- 
monplace" inputs, excluding pathological inputs. Our vision is different: our 
results must be sound in all circumstances, even pathological inputs. 

The impact of fixed- and floating-point errors in digital filters was clas- 
sically studied from by modeling the errors as random sources of known 
distribution, independent of each other and with no temporal correlation 
(i.e. correlations between successive values) [3l[20]. These assumptions are, 
in reality, false: the computational process is fully deterministic, and not 
random; the computations are generally interdependent (all computations 
inside a filter depend on the past of the input variables); and there are tem- 
poral correlations. However, circuit designers are concerned with the spectral 
distribution of output noise and optimization of hardware or software 
implementations with respect to this noise, and these tools are adequate 
for this. On the other hand, we merely aim at providing sound bounds for 
the outputs of the system, but the bounds that we provide must be sound 
without any extra and unfounded suppositions. 

J. Feret has proposed an abstract domain for analyzing programs com- 
prising digital linear filters [6]. He provides effective bounds for first and 
second degree filters. In comparison, we consider more complex filter net- 
works, in a compositional fashion; but we analyze specifications, and not C 
code (which is usually compiled from those specifications, with considerable 
loss of structure). Another difference is that we do not perform abstract 
iterations. Feret's method currently considers only second-order filters (i.e. 
TF2), though it may be possible to adapt it to higher-order filters. On 
second-order filters, the bounds computed by Feret's method and the method 
in this paper are very close (since both are based on a development of the 
convolution kernel, though they use different methods of tail estimation). 

Lamb et al. [H] have proposed effective methods, based on linear algebra, 
for computing equivalent filters for DSP optimization. They do not compute 
bounds, nor do they study floating-point errors. 

Roozbehani et al. [21] find program invariants by Lagrangian relaxation 
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and semidefinite programming, with quadratic invariants. In order to make 
problems tractable, they too apply a blockwise abstraction. The class of pro- 
grams that they may analyze directly is potentially larger, but the results 
are less precise than our method on some linear filters. They do not han- 
dle floating-point imprecisions (though this can perhaps be added to their 
framework) . 

One possible application of our method would be to integrate it as a 
pre-analysis pass of a tool such as Astree [5]. Astree computes bounds on all 
floating-point variables inside the analyzed program, in order to prove the 
absence of errors such as overflow. In order to do so, it needs to compute 
reasonably accurate bounds on the behavior of linear filters. A typical fly- 
by-wire controller contains dozens of TF2 filters, some of which may be 
integrated into more complex feedback loops; in some cases, separate analysis 
of the filters may yield too coarse bounds. 

11 Conclusions and future works 

We have proposed effective methods for providing sound bounds on the out- 
come of complex linear filters from their flow-diagram specifications, as found 
in many applications. Computation times are modest; furthermore, the na- 
ture of the results of the analysis may be used for modular analyses — the 
analysis results of a sub-filter can be stored and never be recomputed until 
the sub-filter changes. 

The usefulness of these methods is twofold. First, they could be directly 
implemented in the graphical user interface for designing circuits. Users may 
then be able to compute gains or to check the stability of filters, taking into 
account floating-point errors (which conventional Z-transform techniques do 
not consider). Second, they can be used as a way to automatically obtain 
static analysis "transformers" or "transfer functions": a static analysis tool 
such as Astree may detect that some program sequence implements such or 
such complex linear filter, and apply some invariant relation computed using 
the techniques in that paper. 

In future works, we will examine the case of non- linear filters and com- 
positional, modular analysis. The analysis of a combination of linear and 
non-linear filters can be done in two ways or a combination thereof: 

• the overall behavior of a nonlinear filter may be constrained by some 
input-output relationship such as ||0||oo ^ (l + e)||-^||oo (example of a 
rate limiter), and this input-output relationship can be integrated into 
the abstract semantics as in Part [7l 

• the overall behavior of a linear filter can be precisely bounded, and 
this bound information can be fed into an analysis of a larger nonlinear 
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filter, such as one based on statically computed relationships between 
intervals [T9] 
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For any matrix M, let us note minors j (M) the determinant of the matrix 
obtained by removing line i and column j from M. We recall that for any 
matrix M of dimension n 

n 

det(M) = J](-l)"-^mij.minorij(M) (47) 
i=i 

and that the determinant is n-linear. Recall that for any matrix M of in- 
vertible determinant, 

M-^ = det{M)-\ [minorij(Af)]* (48) 
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Lemma 3. If A £ ■Mn,niQ[z]{z)) , then there exists B G Q[^](z) such that 
det(/4 - zA) = I - zB . 

Proof. Proof by induction on n. The case n = 1 is trivial. Now let us 
consider n> 1. 

det(Id„ - zA) 

n 

= (1 — zai^i)minori^i(Id„ — zA) + ^^(— l)"zaijminorij(Id„ — zA) 

i=2 

n 

= minori^i(Idn — zA) + z ( — l)"'2:aijminorij(Idn — zA) (49) 

The result follows by the application of the induction hypothesis, and the 
fact that B<Q[z](^z) is a ring and thus the determinant of any matrix over that 
ring is itself in the ring. □ 

Corollary 4. If A £ Ain,niQ[z](^z))> then Idn — zA) has an inverse in 

Proof. By the preceding lemma, det{ldn — zA) is of the form l — zP{z)/Q{z), 
where P and Q are polynomials such that the constant coefficient of Q is 
1, therefore (det(Id„ - zA))'^ = Q{z)/{Q{z) - zP{z) is in Q[z\z)- All the 
minorjj(Id,i — zA) are elements of Q[2](2), the result follows by applying 
Equ. [ni □ 
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